Text Chunking by System Combination

نویسنده

  • Erik F. Tjong Kim Sang
چکیده

Tjong Kim Sang (2000) describes how a systeminternal combination of memory-based learners can be used for base noun phrase (baseNP) recognition. The idea is to generate different chunking models by using different chunk representations. Chunks can be represented with bracket structures but alternatively one can use a tagging representation which classifies words as being inside a chunk (I), outside a chunk (O) or at a chunk boundary (B) (Ramshaw and Marcus, 1995). There are four variants of this representation. The B tags can be used for the first word of chunks that immediately follow another chunk (the IOB1 representation) or they can be used for every chunk-initial word (IOB2). Alternatively an E tag can be used for labeling the final word of a chunk immediately preceding another chunk (IOE1) or it can be used for every chunk-final word (IOE2). Bracket structures can also be represented as tagging structures by using two streams of tags which define whether words start a chunk or not (O) or whether words are at the end of a chunk or not (C). We need both for encoding the phrase structure and hence we will treat the two tag streams as a single representation (O+C). A combination of baseNP classifiers that use the five representation performs better than any of the included systems (Tjong Kim Sang, 2000). We will apply such a classifier combination to the CoNLL-2000 shared task. The individual classifiers will use the memory-based learning algorithm IBi-IG (Daelemans et al., 1999) for determining the most probable tag for each word. In memory-based learning the training data is stored and a new item is classified by the most frequent classification among training items which are closest to this new item. Data items are represented as sets of feature-value pairs. Features receive weights which are based on the amount of information they provide for classifying the training data (Daelemans et al., 1999). We will evaluate nine different methods for combining the output of our five chunkers (Van Halteren et al., 1998). Five are so-called voting methods. They assign weights to the output of the individual systems and use these weights to determine the most probable output tag. Since the classifiers generate different output formats, all classifier output has been converted to the O and the C representations. The most simple voting method assigns uniform weights and picks the tag that occurs most often (Majority). A more advanced method is to use as a weight the accuracy of the classifier on some held-out part of the training data, the tuning data (TotPrecision). One can also use the precision obtained by a classifier for a specific output value as a weight (TagPrecision). Alternatively, we use as a weight a combination of the precision score for the output tag in combination with the recall score for competing tags (PrecisionRecall). The most advanced voting method examines output values of pairs of classifiers and assigns weights to tags based on how often they appear with this pair in the tuning data (TagPair, Van Halteren et al., (1998)).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fast Boosting-based Part-of-Speech Tagging and Text Chunking with Efficient Rule Representation for Sequential Labeling

This paper proposes two techniques for fast sequential labeling such as part-of-speech (POS) tagging and text chunking. The first technique is a boosting-based algorithm that learns rules represented by combination of features. To avoid time-consuming evaluation of combination, we divide features into not used ones and used ones for learning combination. The other is a rule representation. Usua...

متن کامل

Efficient text chunking using linear kernel with masked method

In this paper, we proposed an efficient and accurate text chunking system using linear SVM kernel and a new technique called masked method. Previous researches indicated that systems combination or external parsers can enhance the chunking performance. However, the cost of constructing multi-classifiers is even higher than developing a single processor. Moreover, the use of external resources w...

متن کامل

A Fast Boosting-based Learner for Feature-Rich Tagging and Chunking

Combination of features contributes to a significant improvement in accuracy on tasks such as part-of-speech (POS) tagging and text chunking, compared with using atomic features. However, selecting combination of features on learning with large-scale and feature-rich training data requires long training time. We propose a fast boosting-based algorithm for learning rules represented by combinati...

متن کامل

A Text Chunker and Hybrid POS Tagger for Indian Languages

Part-of-Speech (POS) tagging can be described as a task of doing automatic annotation of syntactic categories for each word in a text document. This paper presents a generic hybrid POS tagger for Indian languages. Indian languages are relatively free word order, morphologically productive and agglutinative languages. In this hybrid implementation we have used combination of statistical approach...

متن کامل

Chunking Clinical Text Containing Non-Canonical Language

Free text notes typed by primary care physicians during patient consultations typically contain highly non-canonical language. Shallow syntactic analysis of free text notes can help to reveal valuable information for the study of disease and treatment. We present an exploratory study into chunking such text using offthe-shelf language processing tools and pre-trained statistical models. We eval...

متن کامل

تعیین مرز و نوع عبارات نحوی در متون فارسی

Text tokenization is the process of tokenizing text to meaningful tokens such as words, phrases, sentences, etc. Tokenization of syntactical phrases named as chunking is an important preprocessing needed in many applications such as machine translation information retrieval, text to speech, etc. In this paper chunking of Farsi texts is done using statistical and learning methods and the grammat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000